Skip to content

Conversation

ggerganov
Copy link
Member

depends on #7461

Start using more lightweight models (Pythia 1.4B and 2.8B vs OpenLlama 3B and 7B)

@mofosyne mofosyne added Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level devops improvements to build systems and github actions labels May 22, 2024
@ggerganov ggerganov force-pushed the gg/ggml-ci-pythia branch from f0d9eda to 57496b2 Compare May 23, 2024 09:54
@ggerganov ggerganov merged commit 55ac3b7 into master May 23, 2024
@ggerganov ggerganov deleted the gg/ggml-ci-pythia branch May 23, 2024 12:28
Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 535 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8701.94ms p(95)=21251.58ms fails=, finish reason: stop=482 truncated=53
  • Prompt processing (pp): avg=100.69tk/s p(95)=423.2tk/s
  • Token generation (tg): avg=49.42tk/s p(95)=47.08tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=gg/ggml-ci-pythia commit=d2bae455466102487c4c5fce15f37e750c7b5756

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 535 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716479076 --> 1716479694
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 702.44, 702.44, 702.44, 702.44, 702.44, 719.54, 719.54, 719.54, 719.54, 719.54, 714.03, 714.03, 714.03, 714.03, 714.03, 728.92, 728.92, 728.92, 728.92, 728.92, 806.66, 806.66, 806.66, 806.66, 806.66, 804.93, 804.93, 804.93, 804.93, 804.93, 791.88, 791.88, 791.88, 791.88, 791.88, 806.38, 806.38, 806.38, 806.38, 806.38, 806.13, 806.13, 806.13, 806.13, 806.13, 802.68, 802.68, 802.68, 802.68, 802.68, 817.45, 817.45, 817.45, 817.45, 817.45, 840.58, 840.58, 840.58, 840.58, 840.58, 785.21, 785.21, 785.21, 785.21, 785.21, 784.11, 784.11, 784.11, 784.11, 784.11, 782.6, 782.6, 782.6, 782.6, 782.6, 769.98, 769.98, 769.98, 769.98, 769.98, 776.75, 776.75, 776.75, 776.75, 776.75, 775.48, 775.48, 775.48, 775.48, 775.48, 774.76, 774.76, 774.76, 774.76, 774.76, 796.49, 796.49, 796.49, 796.49, 796.49, 792.03, 792.03, 792.03, 792.03, 792.03, 795.21, 795.21, 795.21, 795.21, 795.21, 799.09, 799.09, 799.09, 799.09, 799.09, 801.44, 801.44, 801.44, 801.44, 801.44, 799.99, 799.99, 799.99, 799.99, 799.99, 800.7, 800.7, 800.7, 800.7, 800.7, 802.96, 802.96, 802.96, 802.96, 802.96, 819.41, 819.41, 819.41, 819.41, 819.41, 819.96, 819.96, 819.96, 819.96, 819.96, 818.24, 818.24, 818.24, 818.24, 818.24, 818.38, 818.38, 818.38, 818.38, 818.38, 824.71, 824.71, 824.71, 824.71, 824.71, 822.79, 822.79, 822.79, 822.79, 822.79, 824.93, 824.93, 824.93, 824.93, 824.93, 834.88, 834.88, 834.88, 834.88, 834.88, 842.23, 842.23, 842.23, 842.23, 842.23, 848.17, 848.17, 848.17, 848.17, 848.17, 841.45, 841.45, 841.45, 841.45, 841.45, 838.46, 838.46, 838.46, 838.46, 838.46, 838.87, 838.87, 838.87, 838.87, 838.87, 842.3, 842.3, 842.3, 842.3, 842.3, 842.78, 842.78, 842.78, 842.78, 842.78, 833.28, 833.28, 833.28, 833.28, 833.28, 830.98, 830.98, 830.98, 830.98, 830.98, 830.92, 830.92, 830.92, 830.92, 830.92, 830.5, 830.5, 830.5, 830.5, 830.5, 828.3, 828.3, 828.3, 828.3, 828.3, 825.34, 825.34, 825.34, 825.34, 825.34, 830.37, 830.37, 830.37, 830.37, 830.37, 829.68, 829.68, 829.68, 829.68, 829.68, 830.56, 830.56, 830.56, 830.56, 830.56, 834.14, 834.14, 834.14, 834.14, 834.14, 836.78, 836.78, 836.78, 836.78, 836.78, 841.6, 841.6, 841.6, 841.6, 841.6, 841.51, 841.51, 841.51, 841.51, 841.51, 832.53, 832.53, 832.53, 832.53, 832.53, 833.5, 833.5, 833.5, 833.5, 833.5, 833.48, 833.48, 833.48, 833.48, 833.48, 835.2, 835.2, 835.2, 835.2, 835.2, 836.39, 836.39, 836.39, 836.39, 836.39]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 535 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716479076 --> 1716479694
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.96, 41.96, 41.96, 41.96, 41.96, 38.48, 38.48, 38.48, 38.48, 38.48, 34.52, 34.52, 34.52, 34.52, 34.52, 27.89, 27.89, 27.89, 27.89, 27.89, 29.39, 29.39, 29.39, 29.39, 29.39, 29.46, 29.46, 29.46, 29.46, 29.46, 30.46, 30.46, 30.46, 30.46, 30.46, 31.51, 31.51, 31.51, 31.51, 31.51, 31.65, 31.65, 31.65, 31.65, 31.65, 32.0, 32.0, 32.0, 32.0, 32.0, 32.17, 32.17, 32.17, 32.17, 32.17, 32.3, 32.3, 32.3, 32.3, 32.3, 32.51, 32.51, 32.51, 32.51, 32.51, 31.73, 31.73, 31.73, 31.73, 31.73, 31.42, 31.42, 31.42, 31.42, 31.42, 30.55, 30.55, 30.55, 30.55, 30.55, 29.56, 29.56, 29.56, 29.56, 29.56, 29.23, 29.23, 29.23, 29.23, 29.23, 29.5, 29.5, 29.5, 29.5, 29.5, 29.43, 29.43, 29.43, 29.43, 29.43, 29.41, 29.41, 29.41, 29.41, 29.41, 29.77, 29.77, 29.77, 29.77, 29.77, 29.86, 29.86, 29.86, 29.86, 29.86, 30.16, 30.16, 30.16, 30.16, 30.16, 30.09, 30.09, 30.09, 30.09, 30.09, 30.01, 30.01, 30.01, 30.01, 30.01, 30.1, 30.1, 30.1, 30.1, 30.1, 30.16, 30.16, 30.16, 30.16, 30.16, 29.96, 29.96, 29.96, 29.96, 29.96, 30.04, 30.04, 30.04, 30.04, 30.04, 30.45, 30.45, 30.45, 30.45, 30.45, 30.47, 30.47, 30.47, 30.47, 30.47, 30.6, 30.6, 30.6, 30.6, 30.6, 30.71, 30.71, 30.71, 30.71, 30.71, 30.78, 30.78, 30.78, 30.78, 30.78, 30.68, 30.68, 30.68, 30.68, 30.68, 30.52, 30.52, 30.52, 30.52, 30.52, 30.26, 30.26, 30.26, 30.26, 30.26, 29.89, 29.89, 29.89, 29.89, 29.89, 30.15, 30.15, 30.15, 30.15, 30.15, 30.19, 30.19, 30.19, 30.19, 30.19, 30.32, 30.32, 30.32, 30.32, 30.32, 30.41, 30.41, 30.41, 30.41, 30.41, 30.34, 30.34, 30.34, 30.34, 30.34, 29.9, 29.9, 29.9, 29.9, 29.9, 29.82, 29.82, 29.82, 29.82, 29.82, 29.06, 29.06, 29.06, 29.06, 29.06, 28.51, 28.51, 28.51, 28.51, 28.51, 28.5, 28.5, 28.5, 28.5, 28.5, 28.44, 28.44, 28.44, 28.44, 28.44, 28.55, 28.55, 28.55, 28.55, 28.55, 28.56, 28.56, 28.56, 28.56, 28.56, 28.66, 28.66, 28.66, 28.66, 28.66, 28.66, 28.66, 28.66, 28.66, 28.66, 28.68, 28.68, 28.68, 28.68, 28.68, 28.73, 28.73, 28.73, 28.73, 28.73, 28.69, 28.69, 28.69, 28.69, 28.69, 28.87, 28.87, 28.87, 28.87, 28.87, 28.9, 28.9, 28.9, 28.9, 28.9, 29.07, 29.07, 29.07, 29.07, 29.07]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 535 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716479076 --> 1716479694
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.19, 0.19, 0.19, 0.19, 0.19, 0.49, 0.49, 0.49, 0.49, 0.49, 0.32, 0.32, 0.32, 0.32, 0.32, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.32, 0.32, 0.32, 0.32, 0.32, 0.25, 0.25, 0.25, 0.25, 0.25, 0.46, 0.46, 0.46, 0.46, 0.46, 0.37, 0.37, 0.37, 0.37, 0.37, 0.27, 0.27, 0.27, 0.27, 0.27, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.21, 0.21, 0.21, 0.21, 0.21, 0.13, 0.13, 0.13, 0.13, 0.13, 0.32, 0.32, 0.32, 0.32, 0.32, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.42, 0.42, 0.42, 0.42, 0.42, 0.33, 0.33, 0.33, 0.33, 0.33, 0.11, 0.11, 0.11, 0.11, 0.11, 0.09, 0.09, 0.09, 0.09, 0.09, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.32, 0.32, 0.32, 0.32, 0.32, 0.58, 0.58, 0.58, 0.58, 0.58, 0.58, 0.58, 0.58, 0.58, 0.58, 0.61, 0.61, 0.61, 0.61, 0.61, 0.43, 0.43, 0.43, 0.43, 0.43, 0.17, 0.17, 0.17, 0.17, 0.17, 0.29, 0.29, 0.29, 0.29, 0.29, 0.27, 0.27, 0.27, 0.27, 0.27, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.28, 0.28, 0.28, 0.28, 0.28, 0.23, 0.23, 0.23, 0.23, 0.23, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.13, 0.13, 0.13, 0.13, 0.13, 0.07, 0.07, 0.07, 0.07, 0.07, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 535 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716479076 --> 1716479694
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0]
                    
Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants